Assignment 2 CSCN8000 Artificial Intelligence Algorithms and Mathematics Sudhan Shrestha - 8889436
Use iris flower dataset from sklearn library and try to form clusters of flowers using petal width and length features. Drop the other two features for simplicity.
Figure out if any preprocessing such as scaling would help here
Draw elbow plot and from that figure out optimal value of k
# importing various libraries and modules.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
plotly.offline.init_notebook_mode()
import plotly.express as px
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import zscore
from scipy import stats
from sklearn.metrics import accuracy_score
# loading the iris dataset
iris = load_iris(as_frame= True, return_X_y= False)
iris
{'data': sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
.. ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
[150 rows x 4 columns],
'target': 0 0
1 0
2 0
3 0
4 0
..
145 2
146 2
147 2
148 2
149 2
Name: target, Length: 150, dtype: int32,
'frame': sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
.. ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
target
0 0
1 0
2 0
3 0
4 0
.. ...
145 2
146 2
147 2
148 2
149 2
[150 rows x 5 columns],
'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),
'DESCR': '.. _iris_dataset:\n\nIris plants dataset\n--------------------\n\n**Data Set Characteristics:**\n\n :Number of Instances: 150 (50 in each of three classes)\n :Number of Attributes: 4 numeric, predictive attributes and the class\n :Attribute Information:\n - sepal length in cm\n - sepal width in cm\n - petal length in cm\n - petal width in cm\n - class:\n - Iris-Setosa\n - Iris-Versicolour\n - Iris-Virginica\n \n :Summary Statistics:\n\n ============== ==== ==== ======= ===== ====================\n Min Max Mean SD Class Correlation\n ============== ==== ==== ======= ===== ====================\n sepal length: 4.3 7.9 5.84 0.83 0.7826\n sepal width: 2.0 4.4 3.05 0.43 -0.4194\n petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)\n petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)\n ============== ==== ==== ======= ===== ====================\n\n :Missing Attribute Values: None\n :Class Distribution: 33.3% for each of 3 classes.\n :Creator: R.A. Fisher\n :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n :Date: July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s paper. Note that it\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature. Fisher\'s paper is a classic in the field and\nis referenced frequently to this day. (See Duda & Hart, for example.) The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant. One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n.. topic:: References\n\n - Fisher, R.A. "The use of multiple measurements in taxonomic problems"\n Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n Mathematical Statistics" (John Wiley, NY, 1950).\n - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n Structure and Classification Rule for Recognition in Partially Exposed\n Environments". IEEE Transactions on Pattern Analysis and Machine\n Intelligence, Vol. PAMI-2, No. 1, 67-71.\n - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions\n on Information Theory, May 1972, 431-433.\n - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II\n conceptual clustering system finds 3 classes in the data.\n - Many, many more ...',
'feature_names': ['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)'],
'filename': 'iris.csv',
'data_module': 'sklearn.datasets.data'}
# assigning the `iris.data` to the variable `X` and `iris.target` to the variable `y`.
X = iris.data
y = iris.target
print("Shape of X: ", X.shape)
print("Shape of : y", y.shape)
Shape of X: (150, 4) Shape of : y (150,)
X
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 |
| ... | ... | ... | ... | ... |
| 145 | 6.7 | 3.0 | 5.2 | 2.3 |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 |
150 rows × 4 columns
y
0 0
1 0
2 0
3 0
4 0
..
145 2
146 2
147 2
148 2
149 2
Name: target, Length: 150, dtype: int32
# dropping the columns 'sepal length (cm)' and 'sepal width (cm)' from the DataFrame `X`.
# The `axis=1` parameter specifies that the columns should be dropped.
X = X.drop(['sepal length (cm)','sepal width (cm)'],axis=1)
X
| petal length (cm) | petal width (cm) | |
|---|---|---|
| 0 | 1.4 | 0.2 |
| 1 | 1.4 | 0.2 |
| 2 | 1.3 | 0.2 |
| 3 | 1.5 | 0.2 |
| 4 | 1.4 | 0.2 |
| ... | ... | ... |
| 145 | 5.2 | 2.3 |
| 146 | 5.0 | 1.9 |
| 147 | 5.2 | 2.0 |
| 148 | 5.4 | 2.3 |
| 149 | 5.1 | 1.8 |
150 rows × 2 columns
# defining StandardScaler and scaling X
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
from sklearn.metrics import silhouette_score
# performing clustering using the K-means algorithm and calculating the silhouette score to evaluate the quality of the clustering.
# on un-scaled data
kmeans = KMeans(n_clusters=5, random_state=23)
cluster_labels = kmeans.fit_predict(X)
silhouette_no_scaling = silhouette_score(X, cluster_labels)
# on scaled data
kmeans_scaled = KMeans(n_clusters=5, random_state=23)
cluster_labels_scaled = kmeans_scaled.fit_predict(X_scaled)
silhouette_scaling = silhouette_score(X_scaled, cluster_labels_scaled)
print(f"Silhouette Score without scaling: {silhouette_no_scaling:.2f}")
print(f"Silhouette Score with scaling: {silhouette_scaling:.2f}")
Silhouette Score without scaling: 0.59 Silhouette Score with scaling: 0.57
The Silhouette Score without scaling is higher than the score with scaling, indicating that the clustering results are slightly better when scaling was not done to the data. However, the difference in scores is not substantial, and both scores are relatively close to each other.
import warnings
warnings.filterwarnings('ignore')
# performing the K-means clustering and generating an elbow plot to determine the optimal number of clusters for the dataset.
wcss = []
max_clusters = 10
for k in range(1, max_clusters + 1):
kmeans = KMeans(n_clusters=k, random_state=23)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
# Plot the elbow plot
plt.figure(figsize=(10, 6))
plt.plot(range(1, max_clusters + 1), wcss, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Within-Cluster Sum of Squares(WCSS)')
plt.title('The Elbow Plot')
plt.grid(True)
plt.show()
From the elbow plot it becomes evident that the optimal clustering for this is when k = 3.
Use the heart dataset from the Resources Folder or access it from https://www.kaggle.com/fedesoriano/heart-failure-prediction
Load heart disease dataset in pandas dataframe
Remove outliers using Z score. Usual guideline is to remove anything that has Z score > 3 formula or Z score < -3
Convert text columns to numbers using label encoding / one hot encoding
Apply scaling
Build a classification model using various methods (SVM, logistic regression, random forest) and check which model gives you the best accuracy
Now use PCA to reduce dimensions, retrain your model and see its impact on your model in terms of accuracy.
# reading a CSV file named 'heart.csv' and storing its contents in a pandas DataFrame called `df_heart`.
df_heart = pd.read_csv('csv/heart.csv')
df_heart.head()
| Age | Sex | ChestPainType | RestingBP | Cholesterol | FastingBS | RestingECG | MaxHR | ExerciseAngina | Oldpeak | ST_Slope | HeartDisease | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40 | M | ATA | 140 | 289 | 0 | Normal | 172 | N | 0.0 | Up | 0 |
| 1 | 49 | F | NAP | 160 | 180 | 0 | Normal | 156 | N | 1.0 | Flat | 1 |
| 2 | 37 | M | ATA | 130 | 283 | 0 | ST | 98 | N | 0.0 | Up | 0 |
| 3 | 48 | F | ASY | 138 | 214 | 0 | Normal | 108 | Y | 1.5 | Flat | 1 |
| 4 | 54 | M | NAP | 150 | 195 | 0 | Normal | 122 | N | 0.0 | Up | 0 |
df_heart.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 918 entries, 0 to 917 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 918 non-null int64 1 Sex 918 non-null object 2 ChestPainType 918 non-null object 3 RestingBP 918 non-null int64 4 Cholesterol 918 non-null int64 5 FastingBS 918 non-null int64 6 RestingECG 918 non-null object 7 MaxHR 918 non-null int64 8 ExerciseAngina 918 non-null object 9 Oldpeak 918 non-null float64 10 ST_Slope 918 non-null object 11 HeartDisease 918 non-null int64 dtypes: float64(1), int64(6), object(5) memory usage: 86.2+ KB
df_heart.describe()
| Age | RestingBP | Cholesterol | FastingBS | MaxHR | Oldpeak | HeartDisease | |
|---|---|---|---|---|---|---|---|
| count | 918.000000 | 918.000000 | 918.000000 | 918.000000 | 918.000000 | 918.000000 | 918.000000 |
| mean | 53.510893 | 132.396514 | 198.799564 | 0.233115 | 136.809368 | 0.887364 | 0.553377 |
| std | 9.432617 | 18.514154 | 109.384145 | 0.423046 | 25.460334 | 1.066570 | 0.497414 |
| min | 28.000000 | 0.000000 | 0.000000 | 0.000000 | 60.000000 | -2.600000 | 0.000000 |
| 25% | 47.000000 | 120.000000 | 173.250000 | 0.000000 | 120.000000 | 0.000000 | 0.000000 |
| 50% | 54.000000 | 130.000000 | 223.000000 | 0.000000 | 138.000000 | 0.600000 | 1.000000 |
| 75% | 60.000000 | 140.000000 | 267.000000 | 0.000000 | 156.000000 | 1.500000 | 1.000000 |
| max | 77.000000 | 200.000000 | 603.000000 | 1.000000 | 202.000000 | 6.200000 | 1.000000 |
df_heart.shape
(918, 12)
# box plot for the dataset
plt.figure(figsize=(15,10))
df_num = df_heart.select_dtypes(include=['float64', 'int64'])
for i,col in enumerate(df_num.columns,1):
plt.subplot(4,3,i)
plt.title(f"Distribution of {col} Data")
sns.boxplot(df_num[col], color='lightgreen')
plt.tight_layout()
plt.plot()
# performing outlier removal using z-scores.
df_without_outlier_zscore = df_heart.copy()
z_scores = np.abs(stats.zscore(df_without_outlier_zscore.select_dtypes(include=['int64', 'float64'])))
df_without_outlier_zscore = df_without_outlier_zscore[(z_scores < 3).all(axis=1)]
df_without_outlier_zscore.shape
(899, 12)
# box plot for the dataset after outlier removal
plt.figure(figsize=(15,10))
df_num = df_without_outlier_zscore.select_dtypes(include=['float64', 'int64'])
for i,col in enumerate(df_num.columns,1):
plt.subplot(4,3,i)
plt.title(f"Distribution of {col} Data")
sns.boxplot(df_num[col], color='lightgreen')
plt.tight_layout()
plt.plot()
# creating a copy of the dataframe
df = df_without_outlier_zscore.copy()
df.head()
| Age | Sex | ChestPainType | RestingBP | Cholesterol | FastingBS | RestingECG | MaxHR | ExerciseAngina | Oldpeak | ST_Slope | HeartDisease | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40 | M | ATA | 140 | 289 | 0 | Normal | 172 | N | 0.0 | Up | 0 |
| 1 | 49 | F | NAP | 160 | 180 | 0 | Normal | 156 | N | 1.0 | Flat | 1 |
| 2 | 37 | M | ATA | 130 | 283 | 0 | ST | 98 | N | 0.0 | Up | 0 |
| 3 | 48 | F | ASY | 138 | 214 | 0 | Normal | 108 | Y | 1.5 | Flat | 1 |
| 4 | 54 | M | NAP | 150 | 195 | 0 | Normal | 122 | N | 0.0 | Up | 0 |
labelencoder = LabelEncoder()
df['ChestPainType'] = labelencoder.fit_transform(df['ChestPainType'])
df['RestingECG'] = labelencoder.fit_transform(df['RestingECG'])
df['ST_Slope'] = labelencoder.fit_transform(df['ST_Slope'])
df['ExerciseAngina'] = labelencoder.fit_transform(df['ExerciseAngina'])
df['Sex'] = labelencoder.fit_transform(df['Sex'])
df.head()
| Age | Sex | ChestPainType | RestingBP | Cholesterol | FastingBS | RestingECG | MaxHR | ExerciseAngina | Oldpeak | ST_Slope | HeartDisease | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 40 | 1 | 1 | 140 | 289 | 0 | 1 | 172 | 0 | 0.0 | 2 | 0 |
| 1 | 49 | 0 | 2 | 160 | 180 | 0 | 1 | 156 | 0 | 1.0 | 1 | 1 |
| 2 | 37 | 1 | 1 | 130 | 283 | 0 | 2 | 98 | 0 | 0.0 | 2 | 0 |
| 3 | 48 | 0 | 0 | 138 | 214 | 0 | 1 | 108 | 1 | 1.5 | 1 | 1 |
| 4 | 54 | 1 | 2 | 150 | 195 | 0 | 1 | 122 | 0 | 0.0 | 2 | 0 |
# performing scaling and separating the data for a machine learning model.
X = df.drop('HeartDisease', axis=1)
y = df['HeartDisease']
scaler = StandardScaler()
X = scaler.fit_transform(X)
# splitting the data into training and testing sets.
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=36)
from sklearn.model_selection import cross_val_score
# dictionary containing the various models.
models = {
"Logistic Regression": LogisticRegression(),
"SVC": SVC(kernel = 'rbf'),
"Random Forest Classifier": RandomForestClassifier(),
}
# iterating over each model in the `models` dictionary.
for model_name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"The accuracy of {model_name} model: {acc*100:.2f}%")
The accuracy of Logistic Regression model: 85.56% The accuracy of SVC model: 86.67% The accuracy of Random Forest Classifier model: 86.11%
from sklearn.decomposition import PCA
# Applying PCA
n = X.shape[1]
pca = PCA(n_components=n) # applying on all features
X_pca = pca.fit_transform(X)
# splitting the PCA-transformed data into training and testing sets
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca, y, test_size=0.2, random_state=36)
# re-calculating the accuracy of all the models
models = {
"Logistic Regression": LogisticRegression(),
"SVC": SVC(kernel = 'rbf'),
"Random Forest Classifier": RandomForestClassifier(),
}
for model_name, model in models.items():
model.fit(X_train_pca, y_train_pca)
y_pred_pca = model.predict(X_test_pca)
acc_pca = accuracy_score(y_test_pca, y_pred_pca)
print(f"The accuracy of {model_name} model: {acc_pca*100:.2f}%")
The accuracy of Logistic Regression model: 85.56% The accuracy of SVC model: 86.67% The accuracy of Random Forest Classifier model: 81.67%
The results after PCA does not give any further improvements, the accuracy of Random Forest Model did decrease a bit from around 86.11% to 81.67%, where as the SVC and Logistic Regression models accuracy remains the same in our analysis.